{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Manually matching authors\n",
    "\n",
    "This is not a well documented or illuminating notebook. It's a side thread that I pursued in order to standardize author names, using a combination of coding and elbow grease.\n",
    "\n",
    "Basically the idea was to find pairs of names that *might* match, and then scan through the list manually confirming or rejecting the pairings.\n",
    "\n",
    "Considering every possible pair would be exhausting and impossible. To make the task manageable, I sorted pairs by a score that combined the closeness of the names and the sheer *number* of volumes affected. The formula was\n",
    "\n",
    "    num_of_vols_by_A * num_of_vols_by_B * AB_match^2\n",
    "\n",
    "I only considered the top hundred pairs or so. The manually confirmed list of pairs preserved in **manual_author_matches.tsv**, where the first column is an alias and the second column is the name to be preferred. Note that I did this all before realizing that everything needed to pass through [uncode normalization](https://stackoverflow.com/questions/16467479/normalizing-unicode). In using **manual_author_matches** I will also in practice add that normalization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from difflib import SequenceMatcher\n",
    "from collections import Counter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "meta = pd.read_csv('../manifestationmeta.tsv', sep = '\\t', low_memory = False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "names = Counter()\n",
    "for idx in meta.index:\n",
    "    auth = meta.loc[idx, 'author']\n",
    "    names[auth] += 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "720\n"
     ]
    }
   ],
   "source": [
    "blocks = dict()\n",
    "for n in names.keys():\n",
    "    if pd.isnull(n) or len(n) < 2:\n",
    "        continue\n",
    "    else:\n",
    "        block_code = n[0:2]\n",
    "        if block_code not in blocks:\n",
    "            blocks[block_code] = []\n",
    "        blocks[block_code].append(n)\n",
    "\n",
    "print(len(blocks))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\n",
      "11\n",
      "21\n",
      "31\n",
      "41\n",
      "51\n",
      "61\n",
      "71\n",
      "81\n",
      "91\n",
      "101\n",
      "111\n",
      "121\n",
      "131\n",
      "141\n",
      "151\n",
      "161\n",
      "171\n",
      "181\n",
      "191\n",
      "201\n",
      "211\n",
      "221\n",
      "231\n",
      "241\n",
      "251\n",
      "261\n",
      "271\n",
      "281\n",
      "291\n",
      "301\n",
      "311\n",
      "321\n",
      "331\n",
      "341\n",
      "351\n",
      "361\n",
      "371\n",
      "381\n",
      "391\n",
      "401\n",
      "411\n",
      "421\n",
      "431\n",
      "441\n",
      "451\n",
      "461\n",
      "471\n",
      "481\n",
      "491\n",
      "501\n",
      "511\n",
      "521\n",
      "531\n",
      "541\n",
      "551\n",
      "561\n",
      "571\n",
      "581\n",
      "591\n",
      "601\n",
      "611\n",
      "621\n",
      "631\n",
      "641\n",
      "651\n",
      "661\n",
      "671\n",
      "681\n",
      "691\n",
      "701\n",
      "711\n"
     ]
    }
   ],
   "source": [
    "def probablymatch(str1, str2):\n",
    "    \n",
    "    m = SequenceMatcher(None, str1, str2)\n",
    "    match = m.real_quick_ratio()\n",
    "    if match > 0.8:\n",
    "        match = m.ratio()\n",
    "    \n",
    "    return match\n",
    "\n",
    "def get_simtuples(blocks, names):\n",
    "    \n",
    "    simtuples = []\n",
    "\n",
    "    ctr = 0\n",
    "    for code, block in blocks.items():\n",
    "        ctr += 1\n",
    "        if ctr % 10 == 1:\n",
    "            print(ctr)\n",
    "            \n",
    "        for n1 in block:\n",
    "            ct1 = names[n1]\n",
    "\n",
    "            for n2 in block:\n",
    "                \n",
    "                ct2 = names[n2]\n",
    "                if n1 == n2:\n",
    "                    continue\n",
    "\n",
    "                match = probablymatch(n1, n2)\n",
    "                if match > 0.85:\n",
    "                    similarity = match * match * ct1 * ct2\n",
    "                    atuple = (similarity, match, ct1, ct2, n1, n2)\n",
    "                    simtuples.append(atuple)\n",
    "    \n",
    "    return simtuples\n",
    "\n",
    "simtuples = get_simtuples(blocks, names)\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "simtuples = sorted(simtuples, reverse = True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(117571.91836734694,\n",
       "  0.9142857142857143,\n",
       "  485,\n",
       "  290,\n",
       "  'Balzac, Honoré de',\n",
       "  'Balzac, Honoré de'),\n",
       " (117571.91836734692,\n",
       "  0.9142857142857143,\n",
       "  290,\n",
       "  485,\n",
       "  'Balzac, Honoré de',\n",
       "  'Balzac, Honoré de'),\n",
       " (78174.81481481482,\n",
       "  0.8888888888888888,\n",
       "  485,\n",
       "  204,\n",
       "  'Balzac, Honoré de',\n",
       "  'Balzac, Honor?? de'),\n",
       " (78174.8148148148,\n",
       "  0.8888888888888888,\n",
       "  204,\n",
       "  485,\n",
       "  'Balzac, Honor?? de',\n",
       "  'Balzac, Honoré de'),\n",
       " (49452.93061224489,\n",
       "  0.9142857142857143,\n",
       "  290,\n",
       "  204,\n",
       "  'Balzac, Honoré de',\n",
       "  'Balzac, Honor?? de'),\n",
       " (49452.93061224489,\n",
       "  0.9142857142857143,\n",
       "  204,\n",
       "  290,\n",
       "  'Balzac, Honor?? de',\n",
       "  'Balzac, Honoré de'),\n",
       " (17675.89777777778,\n",
       "  0.8666666666666667,\n",
       "  101,\n",
       "  233,\n",
       "  'King, Charles',\n",
       "  'Kingsley, Charles'),\n",
       " (17675.897777777776,\n",
       "  0.8666666666666667,\n",
       "  233,\n",
       "  101,\n",
       "  'Kingsley, Charles',\n",
       "  'King, Charles'),\n",
       " (15152.972972972975,\n",
       "  0.918918918918919,\n",
       "  485,\n",
       "  37,\n",
       "  'Balzac, Honoré de',\n",
       "  'Balzac, Honore   de'),\n",
       " (15152.972972972973,\n",
       "  0.918918918918919,\n",
       "  37,\n",
       "  485,\n",
       "  'Balzac, Honore   de',\n",
       "  'Balzac, Honoré de'),\n",
       " (8478.024691358025,\n",
       "  0.8888888888888888,\n",
       "  290,\n",
       "  37,\n",
       "  'Balzac, Honoré de',\n",
       "  'Balzac, Honore   de'),\n",
       " (8478.024691358025,\n",
       "  0.8888888888888888,\n",
       "  37,\n",
       "  290,\n",
       "  'Balzac, Honore   de',\n",
       "  'Balzac, Honoré de'),\n",
       " (5645.837837837839,\n",
       "  0.8648648648648649,\n",
       "  37,\n",
       "  204,\n",
       "  'Balzac, Honore   de',\n",
       "  'Balzac, Honor?? de'),\n",
       " (5645.837837837838,\n",
       "  0.8648648648648649,\n",
       "  204,\n",
       "  37,\n",
       "  'Balzac, Honor?? de',\n",
       "  'Balzac, Honore   de'),\n",
       " (5052.768166089965,\n",
       "  0.8823529411764706,\n",
       "  1298,\n",
       "  5,\n",
       "  'Dickens, Charles',\n",
       "  'Dickinson, Charles'),\n",
       " (5052.768166089965,\n",
       "  0.8823529411764706,\n",
       "  5,\n",
       "  1298,\n",
       "  'Dickinson, Charles',\n",
       "  'Dickens, Charles'),\n",
       " (3853.2000000000003,\n",
       "  0.8666666666666667,\n",
       "  570,\n",
       "  9,\n",
       "  'Eliot, George',\n",
       "  'Elliott, George P'),\n",
       " (3853.2000000000003,\n",
       "  0.8666666666666667,\n",
       "  9,\n",
       "  570,\n",
       "  'Elliott, George P',\n",
       "  'Eliot, George'),\n",
       " (3683.8922448979592,\n",
       "  0.9142857142857143,\n",
       "  113,\n",
       "  39,\n",
       "  'Brontë, Charlotte',\n",
       "  'Brontë, Charlotte'),\n",
       " (3683.8922448979592,\n",
       "  0.9142857142857143,\n",
       "  39,\n",
       "  113,\n",
       "  'Brontë, Charlotte',\n",
       "  'Brontë, Charlotte')]"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "simtuples[0: 20]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('simtuples.tsv', mode = 'w', encoding = 'utf-8') as f:\n",
    "    for s in simtuples:\n",
    "        stringlist = [str(x) for x in s]\n",
    "        line = '\\t'.join(stringlist) + '\\n'\n",
    "        f.write(line)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Balzac, Honoré de 485 | Balzac, Honoré de 290\n"
     ]
    },
    {
     "ename": "KeyboardInterrupt",
     "evalue": "",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mKeyboardInterrupt\u001b[0m                         Traceback (most recent call last)",
      "\u001b[0;32m/Users/tunder/miniconda3/lib/python3.5/site-packages/ipykernel/kernelbase.py\u001b[0m in \u001b[0;36m_input_request\u001b[0;34m(self, prompt, ident, parent, password)\u001b[0m\n\u001b[1;32m    701\u001b[0m             \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 702\u001b[0;31m                 \u001b[0mident\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mreply\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msession\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrecv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstdin_socket\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    703\u001b[0m             \u001b[0;32mexcept\u001b[0m \u001b[0mException\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/Users/tunder/miniconda3/lib/python3.5/site-packages/jupyter_client/session.py\u001b[0m in \u001b[0;36mrecv\u001b[0;34m(self, socket, mode, content, copy)\u001b[0m\n\u001b[1;32m    802\u001b[0m         \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 803\u001b[0;31m             \u001b[0mmsg_list\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0msocket\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrecv_multipart\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmode\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcopy\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcopy\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    804\u001b[0m         \u001b[0;32mexcept\u001b[0m \u001b[0mzmq\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mZMQError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/Users/tunder/miniconda3/lib/python3.5/site-packages/zmq/sugar/socket.py\u001b[0m in \u001b[0;36mrecv_multipart\u001b[0;34m(self, flags, copy, track)\u001b[0m\n\u001b[1;32m    357\u001b[0m         \"\"\"\n\u001b[0;32m--> 358\u001b[0;31m         \u001b[0mparts\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrecv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mflags\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcopy\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcopy\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtrack\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mtrack\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    359\u001b[0m         \u001b[0;31m# have first part already, only loop while more to receive\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32mzmq/backend/cython/socket.pyx\u001b[0m in \u001b[0;36mzmq.backend.cython.socket.Socket.recv (zmq/backend/cython/socket.c:6971)\u001b[0;34m()\u001b[0m\n",
      "\u001b[0;32mzmq/backend/cython/socket.pyx\u001b[0m in \u001b[0;36mzmq.backend.cython.socket.Socket.recv (zmq/backend/cython/socket.c:6763)\u001b[0;34m()\u001b[0m\n",
      "\u001b[0;32mzmq/backend/cython/socket.pyx\u001b[0m in \u001b[0;36mzmq.backend.cython.socket._recv_copy (zmq/backend/cython/socket.c:1931)\u001b[0;34m()\u001b[0m\n",
      "\u001b[0;32m/Users/tunder/miniconda3/lib/python3.5/site-packages/zmq/backend/cython/checkrc.pxd\u001b[0m in \u001b[0;36mzmq.backend.cython.checkrc._check_rc (zmq/backend/cython/socket.c:7222)\u001b[0;34m()\u001b[0m\n",
      "\u001b[0;31mKeyboardInterrupt\u001b[0m: ",
      "\nDuring handling of the above exception, another exception occurred:\n",
      "\u001b[0;31mKeyboardInterrupt\u001b[0m                         Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-32-04276eb09eac>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      7\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      8\u001b[0m         \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mn1\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0;34m' '\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0mstr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mct1\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0;34m' | '\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0mn2\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0;34m' '\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0mstr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mct2\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 9\u001b[0;31m         \u001b[0mresponse\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0minput\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'?'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     10\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0mresponse\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;34m'1'\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     11\u001b[0m             \u001b[0mequivalents\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mn2\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mn1\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/Users/tunder/miniconda3/lib/python3.5/site-packages/ipykernel/kernelbase.py\u001b[0m in \u001b[0;36mraw_input\u001b[0;34m(self, prompt)\u001b[0m\n\u001b[1;32m    675\u001b[0m             \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_parent_ident\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    676\u001b[0m             \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_parent_header\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 677\u001b[0;31m             \u001b[0mpassword\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    678\u001b[0m         )\n\u001b[1;32m    679\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/Users/tunder/miniconda3/lib/python3.5/site-packages/ipykernel/kernelbase.py\u001b[0m in \u001b[0;36m_input_request\u001b[0;34m(self, prompt, ident, parent, password)\u001b[0m\n\u001b[1;32m    705\u001b[0m             \u001b[0;32mexcept\u001b[0m \u001b[0mKeyboardInterrupt\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    706\u001b[0m                 \u001b[0;31m# re-raise KeyboardInterrupt, to truncate traceback\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 707\u001b[0;31m                 \u001b[0;32mraise\u001b[0m \u001b[0mKeyboardInterrupt\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    708\u001b[0m             \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    709\u001b[0m                 \u001b[0;32mbreak\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mKeyboardInterrupt\u001b[0m: "
     ]
    }
   ],
   "source": [
    "matched_already = set()\n",
    "equivalents = dict()\n",
    "for s in simtuples:\n",
    "    similarity, match, ct1, ct2, n1, n2 = s\n",
    "    if (n2, n1) in matched_already:\n",
    "        continue\n",
    "    else:\n",
    "        print(n1 + ' ' + str(ct1) + ' | ' + n2 + ' ' + str(ct2))\n",
    "        response = input('?')\n",
    "        if response == '1':\n",
    "            equivalents[n2] = n1\n",
    "        elif response == '2':\n",
    "            equivalents[n1] = n2\n",
    "        elif response == 'stop':\n",
    "            with open('manual_author_matches.tsv', mode = 'a', encoding = 'utf-8') as f:\n",
    "                f.write('alias\\trealname\\n')\n",
    "                for k, v in equivalents.items():\n",
    "                    f.write(k + '\\t' + v + '\\n')\n",
    "            break\n",
    "        elif len(response) > 2:\n",
    "            equivalents[n1] = response\n",
    "            equivalents[n2] = response\n",
    "        \n",
    "        matched_already.add((n1, n2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "661\n",
      "1\n",
      "11\n",
      "21\n",
      "31\n",
      "41\n",
      "51\n",
      "61\n",
      "71\n",
      "81\n",
      "91\n",
      "101\n",
      "111\n",
      "121\n",
      "131\n",
      "141\n",
      "151\n",
      "161\n",
      "171\n",
      "181\n",
      "191\n",
      "201\n",
      "211\n",
      "221\n",
      "231\n",
      "241\n",
      "251\n",
      "261\n",
      "271\n",
      "281\n",
      "291\n",
      "301\n",
      "311\n",
      "321\n",
      "331\n",
      "341\n",
      "351\n",
      "361\n",
      "371\n",
      "381\n",
      "391\n",
      "401\n",
      "411\n",
      "421\n",
      "431\n",
      "441\n",
      "451\n",
      "461\n",
      "471\n",
      "481\n",
      "491\n",
      "501\n",
      "511\n",
      "521\n",
      "531\n",
      "541\n",
      "551\n",
      "561\n",
      "571\n",
      "581\n",
      "591\n",
      "601\n",
      "611\n",
      "621\n",
      "631\n",
      "641\n",
      "651\n",
      "661\n",
      "671\n",
      "681\n",
      "691\n",
      "701\n",
      "711\n"
     ]
    }
   ],
   "source": [
    "# what about situations where the first word throws us off\n",
    "# by putting author names in different blocks?\n",
    "# We need a block list based on the second\n",
    "\n",
    "secondblocks = dict()\n",
    "\n",
    "for n in names.keys():\n",
    "    if pd.isnull(n):\n",
    "        continue\n",
    "    \n",
    "    name = n.replace(',', ' ')\n",
    "    # in case things are glued by a comma\n",
    "    \n",
    "    words = name.split()\n",
    "    if len(words) < 2:\n",
    "        continue\n",
    "    if len(words[1]) < 2:\n",
    "        continue\n",
    "    else:\n",
    "        block_code = words[1][0:2]\n",
    "        if block_code not in secondblocks:\n",
    "            secondblocks[block_code] = []\n",
    "        secondblocks[block_code].append(n)\n",
    "\n",
    "print(len(secondblocks))\n",
    "\n",
    "def get_misaligned_simtuples(blocks, secondblocks, names):\n",
    "    \n",
    "    simtuples = []\n",
    "\n",
    "    ctr = 0\n",
    "    for code, block in blocks.items():\n",
    "        ctr += 1\n",
    "        if ctr % 10 == 1:\n",
    "            print(ctr)\n",
    "            \n",
    "        if code in secondblocks:\n",
    "            secondblock = secondblocks[code]\n",
    "        else:\n",
    "            continue\n",
    "            \n",
    "        for n1 in block:\n",
    "            ct1 = names[n1]\n",
    "\n",
    "            for n2 in secondblock:\n",
    "                \n",
    "                ct2 = names[n2]\n",
    "                if n1 == n2:\n",
    "                    continue\n",
    "\n",
    "                match = probablymatch(n1, n2)\n",
    "                if match > 0.9:\n",
    "                    # note higher threshold here\n",
    "                    # matches are going to be rare\n",
    "                    \n",
    "                    similarity = match * match * ct1 * ct2\n",
    "                    atuple = (similarity, match, ct1, ct2, n1, n2)\n",
    "                    simtuples.append(atuple)\n",
    "    \n",
    "    return simtuples\n",
    "\n",
    "misaligned_simtuples = get_misaligned_simtuples(blocks, secondblocks, names)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dafoe, Daniel 1 | Defoe, Daniel 765\n",
      "?2\n",
      "Greene, Graham 148 | Green, Graham 1\n",
      "?\n",
      "Bjørnson, Bjørnstjerne 76 | Bjr̜nson, Bjr̜nstjerne 1\n",
      "?1\n",
      "Herbert, Henry William 17 | Weber, Henry William 3\n",
      "?\n",
      "Bj??rnson, Bj??rnstjerne 22 | Bjr??nson, Bjr??nstjerne 1\n",
      "?Bjørnson, Bjørnstjerne\n",
      "Bj??rnson, Bj??rnstjerne 22 | Bjr??nson, Bj?_rnstjerne 1\n",
      "?Bjørnson, Bjørnstjerne\n",
      "MacCarthy, Mary 1 | McCarthy, Mary 19\n",
      "?\n",
      "Brougham and Vaux, Henry Brougham 2 | Baron, Brougham and Vaux, Henry Brougham 6\n",
      "?1\n",
      "Williams, William 5 | Williams, William G 2\n",
      "?\n",
      "Williams, William 5 | Willis, William 2\n",
      "?\n",
      "[Thomas, Lida Larrimore (Turner) Mrs.] 1 | ] [Thomas, Lida Larrimore (Turner), Mrs 7\n",
      "?Thomas, Lida Larrimore (Turner), Mrs\n",
      "Lamington, Alexander Dundas Ross Wishart Cochrane-Baillie 6 | baron, Lamington, Alexander Dundas Ross Wishart Cochrane-Baillie 1\n",
      "?1\n",
      "Lamington, Alexander Dundas Ross Wishart Cochrane-Baillie 6 | Baron, Lamington, Alexander Dundas Ross Wishart Cochrane-Baillie 1\n",
      "?1\n",
      "Mahy, Margaret 3 | Mayo, Margaret 2\n",
      "?\n",
      "Villaseñor, Victor 3 | Villaseñor, Victor 2\n",
      "?1\n",
      "novelist. Greenwood, James 4 | the novelist. Greenwood, James 1\n",
      "?Greenwood, James\n",
      "Martínez, Max 3 | Martinez, Max 1\n",
      "?1\n",
      "Mayer, Martin 1 | Myers, Martin 3\n",
      "?\n",
      "Ĭovkov, Ĭordan 2 | Ĭokov, Ĭordan 1\n",
      "?1\n",
      "Martínez, Max 2 | Martinez, Max 1\n",
      "?1\n",
      "Vasilikos, Vasilēs 2 | Vasilikos, Vasilēs 1\n",
      "?1\n",
      "Jayakoḍi, Jayasēna 2 | Jayakoḍi, Jayasēna 1\n",
      "?1\n",
      "Matthews, Margaret Bryan Mrs 1 | [Matthews, Margaret Bryan, Mrs.,] 2\n",
      "?1\n",
      "Khammān Khonkhai 2 | Khammān Khonkhai 1\n",
      "?1\n",
      "Eliot, Elizabeth 2 | Elliot, Elisabeth 1\n",
      "?stop\n"
     ]
    }
   ],
   "source": [
    "matched_already = set()\n",
    "equivalents = dict()\n",
    "for s in misaligned_simtuples:\n",
    "    similarity, match, ct1, ct2, n1, n2 = s\n",
    "    if (n2, n1) in matched_already:\n",
    "        continue\n",
    "    else:\n",
    "        print(n1 + ' ' + str(ct1) + ' | ' + n2 + ' ' + str(ct2))\n",
    "        response = input('?')\n",
    "        if response == '1':\n",
    "            equivalents[n2] = n1\n",
    "        elif response == '2':\n",
    "            equivalents[n1] = n2\n",
    "        elif response == 'stop':\n",
    "            with open('manual_author_matches.tsv', mode = 'a', encoding = 'utf-8') as f:\n",
    "                for k, v in equivalents.items():\n",
    "                    f.write(k + '\\t' + v + '\\n')\n",
    "            break\n",
    "        elif len(response) > 2:\n",
    "            equivalents[n1] = response\n",
    "            equivalents[n2] = response\n",
    "        \n",
    "        matched_already.add((n1, n2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "68"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(misaligned_simtuples)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "misaligned_simtuples.sort(reverse = True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(651.8343195266273,\n",
       "  0.9230769230769231,\n",
       "  1,\n",
       "  765,\n",
       "  'Dafoe, Daniel',\n",
       "  'Defoe, Daniel'),\n",
       " (137.24005486968449,\n",
       "  0.9629629629629629,\n",
       "  148,\n",
       "  1,\n",
       "  'Greene, Graham',\n",
       "  'Green, Graham'),\n",
       " (137.24005486968449,\n",
       "  0.9629629629629629,\n",
       "  1,\n",
       "  148,\n",
       "  'Green, Graham',\n",
       "  'Greene, Graham'),\n",
       " (62.809917355371894,\n",
       "  0.9090909090909091,\n",
       "  76,\n",
       "  1,\n",
       "  'Bjørnson, Bjørnstjerne',\n",
       "  'Bjr̜nson, Bjr̜nstjerne'),\n",
       " (62.809917355371894,\n",
       "  0.9090909090909091,\n",
       "  1,\n",
       "  76,\n",
       "  'Bjr̜nson, Bjr̜nstjerne',\n",
       "  'Bjørnson, Bjørnstjerne'),\n",
       " (41.74829931972789,\n",
       "  0.9047619047619048,\n",
       "  17,\n",
       "  3,\n",
       "  'Herbert, Henry William',\n",
       "  'Weber, Henry William'),\n",
       " (18.486111111111107,\n",
       "  0.9166666666666666,\n",
       "  22,\n",
       "  1,\n",
       "  'Bj??rnson, Bj??rnstjerne',\n",
       "  'Bjr??nson, Bjr??nstjerne'),\n",
       " (18.486111111111107,\n",
       "  0.9166666666666666,\n",
       "  22,\n",
       "  1,\n",
       "  'Bj??rnson, Bj??rnstjerne',\n",
       "  'Bjr??nson, Bj?_rnstjerne'),\n",
       " (18.486111111111107,\n",
       "  0.9166666666666666,\n",
       "  1,\n",
       "  22,\n",
       "  'Bjr??nson, Bjr??nstjerne',\n",
       "  'Bj??rnson, Bj??rnstjerne'),\n",
       " (18.486111111111107,\n",
       "  0.9166666666666666,\n",
       "  1,\n",
       "  22,\n",
       "  'Bjr??nson, Bj?_rnstjerne',\n",
       "  'Bj??rnson, Bj??rnstjerne'),\n",
       " (17.712247324613557,\n",
       "  0.9655172413793104,\n",
       "  1,\n",
       "  19,\n",
       "  'MacCarthy, Mary',\n",
       "  'McCarthy, Mary'),\n",
       " (9.80896978795271,\n",
       "  0.9041095890410958,\n",
       "  2,\n",
       "  6,\n",
       "  'Brougham and Vaux, Henry Brougham',\n",
       "  'Baron, Brougham and Vaux, Henry Brougham'),\n",
       " (8.919753086419753,\n",
       "  0.9444444444444444,\n",
       "  5,\n",
       "  2,\n",
       "  'Williams, William',\n",
       "  'Williams, William G'),\n",
       " (8.919753086419753,\n",
       "  0.9444444444444444,\n",
       "  2,\n",
       "  5,\n",
       "  'Williams, William G',\n",
       "  'Williams, William'),\n",
       " (8.7890625, 0.9375, 5, 2, 'Williams, William', 'Willis, William'),\n",
       " (8.7890625, 0.9375, 2, 5, 'Willis, William', 'Williams, William'),\n",
       " (6.120425029515938,\n",
       "  0.935064935064935,\n",
       "  1,\n",
       "  7,\n",
       "  '[Thomas, Lida Larrimore (Turner) Mrs.]',\n",
       "  '] [Thomas, Lida Larrimore (Turner), Mrs'),\n",
       " (5.325865719554675,\n",
       "  0.9421487603305785,\n",
       "  6,\n",
       "  1,\n",
       "  'Lamington, Alexander Dundas Ross Wishart Cochrane-Baillie',\n",
       "  'baron, Lamington, Alexander Dundas Ross Wishart Cochrane-Baillie'),\n",
       " (5.325865719554675,\n",
       "  0.9421487603305785,\n",
       "  6,\n",
       "  1,\n",
       "  'Lamington, Alexander Dundas Ross Wishart Cochrane-Baillie',\n",
       "  'Baron, Lamington, Alexander Dundas Ross Wishart Cochrane-Baillie'),\n",
       " (5.173469387755102,\n",
       "  0.9285714285714286,\n",
       "  3,\n",
       "  2,\n",
       "  'Mahy, Margaret',\n",
       "  'Mayo, Margaret'),\n",
       " (5.173469387755102,\n",
       "  0.9285714285714286,\n",
       "  2,\n",
       "  3,\n",
       "  'Mayo, Margaret',\n",
       "  'Mahy, Margaret'),\n",
       " (5.066471877282689,\n",
       "  0.918918918918919,\n",
       "  3,\n",
       "  2,\n",
       "  'Villaseñor, Victor',\n",
       "  'Villaseñor, Victor'),\n",
       " (5.066471877282689,\n",
       "  0.918918918918919,\n",
       "  2,\n",
       "  3,\n",
       "  'Villaseñor, Victor',\n",
       "  'Villaseñor, Victor'),\n",
       " (3.4489795918367347,\n",
       "  0.9285714285714286,\n",
       "  4,\n",
       "  1,\n",
       "  'novelist. Greenwood, James',\n",
       "  'the novelist. Greenwood, James'),\n",
       " (2.55621301775148,\n",
       "  0.9230769230769231,\n",
       "  3,\n",
       "  1,\n",
       "  'Martínez, Max',\n",
       "  'Martinez, Max'),\n",
       " (2.55621301775148,\n",
       "  0.9230769230769231,\n",
       "  1,\n",
       "  3,\n",
       "  'Mayer, Martin',\n",
       "  'Myers, Martin'),\n",
       " (2.55621301775148,\n",
       "  0.9230769230769231,\n",
       "  1,\n",
       "  3,\n",
       "  'Martinez, Max',\n",
       "  'Martínez, Max'),\n",
       " (1.8545953360768173,\n",
       "  0.9629629629629629,\n",
       "  2,\n",
       "  1,\n",
       "  'Ĭovkov, Ĭordan',\n",
       "  'Ĭokov, Ĭordan'),\n",
       " (1.8545953360768173,\n",
       "  0.9629629629629629,\n",
       "  2,\n",
       "  1,\n",
       "  'Martínez, Max',\n",
       "  'Martinez, Max'),\n",
       " (1.8545953360768173,\n",
       "  0.9629629629629629,\n",
       "  1,\n",
       "  2,\n",
       "  'Ĭokov, Ĭordan',\n",
       "  'Ĭovkov, Ĭordan'),\n",
       " (1.8545953360768173,\n",
       "  0.9629629629629629,\n",
       "  1,\n",
       "  2,\n",
       "  'Martinez, Max',\n",
       "  'Martínez, Max'),\n",
       " (1.6888239590942296,\n",
       "  0.918918918918919,\n",
       "  2,\n",
       "  1,\n",
       "  'Vasilikos, Vasilēs',\n",
       "  'Vasilikos, Vasilēs'),\n",
       " (1.6888239590942296,\n",
       "  0.918918918918919,\n",
       "  2,\n",
       "  1,\n",
       "  'Jayakoḍi, Jayasēna',\n",
       "  'Jayakoḍi, Jayasēna'),\n",
       " (1.6888239590942296,\n",
       "  0.918918918918919,\n",
       "  1,\n",
       "  2,\n",
       "  'Vasilikos, Vasilēs',\n",
       "  'Vasilikos, Vasilēs'),\n",
       " (1.6888239590942296,\n",
       "  0.918918918918919,\n",
       "  1,\n",
       "  2,\n",
       "  'Jayakoḍi, Jayasēna',\n",
       "  'Jayakoḍi, Jayasēna'),\n",
       " (1.6855683955925826,\n",
       "  0.9180327868852459,\n",
       "  1,\n",
       "  2,\n",
       "  'Matthews, Margaret Bryan Mrs',\n",
       "  '[Matthews, Margaret Bryan, Mrs.,]'),\n",
       " (1.652892561983471,\n",
       "  0.9090909090909091,\n",
       "  2,\n",
       "  1,\n",
       "  'Khammān Khonkhai',\n",
       "  'Khammān Khonkhai'),\n",
       " (1.652892561983471,\n",
       "  0.9090909090909091,\n",
       "  2,\n",
       "  1,\n",
       "  'Eliot, Elizabeth',\n",
       "  'Elliot, Elisabeth'),\n",
       " (1.652892561983471,\n",
       "  0.9090909090909091,\n",
       "  1,\n",
       "  2,\n",
       "  'Khammān Khonkhai',\n",
       "  'Khammān Khonkhai'),\n",
       " (1.652892561983471,\n",
       "  0.9090909090909091,\n",
       "  1,\n",
       "  2,\n",
       "  'Elliot, Elisabeth',\n",
       "  'Eliot, Elizabeth'),\n",
       " (1.6316337148803328,\n",
       "  0.9032258064516129,\n",
       "  1,\n",
       "  2,\n",
       "  'Winter, William',\n",
       "  'Painter, William'),\n",
       " (0.9245562130177515,\n",
       "  0.9615384615384616,\n",
       "  1,\n",
       "  1,\n",
       "  'Mackay, Margaret Elizabeth',\n",
       "  'MacKay, Margaret Elizabeth'),\n",
       " (0.9245562130177515,\n",
       "  0.9615384615384616,\n",
       "  1,\n",
       "  1,\n",
       "  'MacKay, Margaret Elizabeth',\n",
       "  'Mackay, Margaret Elizabeth'),\n",
       " (0.9025, 0.95, 1, 1, 'Halsey, Harlan Page', '[Halsey, Harlan Page]'),\n",
       " (0.8975069252077561,\n",
       "  0.9473684210526315,\n",
       "  1,\n",
       "  1,\n",
       "  'Jordon, John Alfred',\n",
       "  'Jordan, John Alfred'),\n",
       " (0.8975069252077561,\n",
       "  0.9473684210526315,\n",
       "  1,\n",
       "  1,\n",
       "  'Jordan, John Alfred',\n",
       "  'Jordon, John Alfred'),\n",
       " (0.87890625, 0.9375, 1, 1, 'Rogers, Robert L', 'Rogers, Robert C'),\n",
       " (0.87890625, 0.9375, 1, 1, 'Rogers, Robert L', 'Rogers, Robert B'),\n",
       " (0.87890625, 0.9375, 1, 1, 'Rogers, Robert C', 'Rogers, Robert L'),\n",
       " (0.87890625, 0.9375, 1, 1, 'Rogers, Robert C', 'Rogers, Robert B'),\n",
       " (0.87890625, 0.9375, 1, 1, 'Rogers, Robert B', 'Rogers, Robert L'),\n",
       " (0.87890625, 0.9375, 1, 1, 'Rogers, Robert B', 'Rogers, Robert C'),\n",
       " (0.8711111111111112,\n",
       "  0.9333333333333333,\n",
       "  1,\n",
       "  1,\n",
       "  'Pichard, Pierre',\n",
       "  'Piccard, Pierre'),\n",
       " (0.8711111111111112,\n",
       "  0.9333333333333333,\n",
       "  1,\n",
       "  1,\n",
       "  'Piccard, Pierre',\n",
       "  'Pichard, Pierre'),\n",
       " (0.8635853484338333,\n",
       "  0.9292929292929293,\n",
       "  1,\n",
       "  1,\n",
       "  'Brabourne, Edward Hugessen Knatchbull-Hugessen',\n",
       "  'baron, Brabourne, Edward Hugessen Knatchbull-Hugessen'),\n",
       " (0.8635853484338333,\n",
       "  0.9292929292929293,\n",
       "  1,\n",
       "  1,\n",
       "  'Brabourne, Edward Hugessen Knatchbull-Hugessen',\n",
       "  'Baron, Brabourne, Edward Hugessen Knatchbull-Hugessen'),\n",
       " (0.8622448979591837,\n",
       "  0.9285714285714286,\n",
       "  1,\n",
       "  1,\n",
       "  'Savage, Sarah',\n",
       "  '[Savage, Sarah]'),\n",
       " (0.8520710059171599,\n",
       "  0.9230769230769231,\n",
       "  1,\n",
       "  1,\n",
       "  'Martin, Maria',\n",
       "  'Marian, Maria'),\n",
       " (0.8520710059171599,\n",
       "  0.9230769230769231,\n",
       "  1,\n",
       "  1,\n",
       "  'Marian, Maria',\n",
       "  'Martin, Maria'),\n",
       " (0.8483379501385041,\n",
       "  0.9210526315789473,\n",
       "  1,\n",
       "  1,\n",
       "  'Troubetzkoy, Amelie (Rives) Chanler',\n",
       "  '1863- Troubetzkoy, Amelie (Rives) Chanler'),\n",
       " (0.8402777777777777,\n",
       "  0.9166666666666666,\n",
       "  1,\n",
       "  1,\n",
       "  'Bjr??nson, Bjr??nstjerne',\n",
       "  'Bjr??nson, Bj?_rnstjerne'),\n",
       " (0.8402777777777777,\n",
       "  0.9166666666666666,\n",
       "  1,\n",
       "  1,\n",
       "  'Bjr??nson, Bj?_rnstjerne',\n",
       "  'Bjr??nson, Bjr??nstjerne'),\n",
       " (0.8264462809917354,\n",
       "  0.9090909090909091,\n",
       "  1,\n",
       "  1,\n",
       "  'novelist Greenwood, James',\n",
       "  'the novelist. Greenwood, James'),\n",
       " (0.8212890625,\n",
       "  0.90625,\n",
       "  1,\n",
       "  1,\n",
       "  'Shaw, Flora L. (Flora Louisa)',\n",
       "  'lady. Shaw, Flora L. (Flora Louisa)'),\n",
       " (0.8185941043083901,\n",
       "  0.9047619047619048,\n",
       "  1,\n",
       "  1,\n",
       "  'Rāmasvāmi Raju, P. V',\n",
       "  'Ramaswami Raju, P. V'),\n",
       " (0.8185941043083901,\n",
       "  0.9047619047619048,\n",
       "  1,\n",
       "  1,\n",
       "  'Ramaswami Raju, P. V',\n",
       "  'Rāmasvāmi Raju, P. V'),\n",
       " (0.8158168574401664,\n",
       "  0.9032258064516129,\n",
       "  1,\n",
       "  1,\n",
       "  'Miller, Michelle',\n",
       "  'Miller, Michael'),\n",
       " (0.8158168574401664,\n",
       "  0.9032258064516129,\n",
       "  1,\n",
       "  1,\n",
       "  'Miller, Michael',\n",
       "  'Miller, Michelle')]"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "misaligned_simtuples\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
